Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation
نویسندگان
چکیده
In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their proper target language translations. Since manually annotating training examples is very costly, we are experimenting with a method to extract examples automatically from parallel corpora. Our algorithm relies on monolingual and bilingual lexicons and dictionaries in addition to statistical methods in order to annotate examples extracted from a large EnglishHungarian parallel corpus accurately aligned at sentence level. In the paper, we present an experiment with the English noun state, where we categorized its different occurrences in the Hunglish parallel corpus. Our experiment showed that 93% of all corpus occurrences of state formed multiword lexemes with unambiguous Hungarian translations, hence these can be omitted from the training data. The remaining 7% of all occurrences is still sufficient for producing training data.
منابع مشابه
Exploiting Parallel Texts to Produce a Multilingual Sense Tagged Corpus for Word Sense Disambiguation
We describe an approach to the automatic creation of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The approach uses parallel corpora, translation dictionaries and a set of straightforward heuristics. In an evaluation with nine corpora containing 10 ambiguous verbs, the approach achieved an average precision of 94%, ...
متن کاملWord Sense Disambiguation Using Automatically Translated Sense Examples
We present an unsupervised approach to Word Sense Disambiguation (WSD). We automatically acquire English sense examples using an English-Chinese bilingual dictionary, Chinese monolingual corpora and Chinese-English machine translation software. We then train machine learning classifiers on these sense examples and test them on two gold standard English WSD datasets, one for binary and the other...
متن کاملUsing Parallel Corpora for Word Sense Disambiguation
Word Sense Disambiguation (WSD) is the Natural Language Processing (NLP) task that consists in selecting the correct sense of a polysemous word in a given context. Most state-of-the-art WSD systems are supervised classifiers that are trained on manually sense-tagged corpora, which are very time-consuming and expensive to build. In order to overcome this acquisition bottleneck (sense-tagged corp...
متن کاملExploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study
A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning. In this paper, we evaluate an approach to automatically acquire sensetagged training data from English-Chinese parallel corpora, which are then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task. Our investigation reveals that this method ...
متن کاملCorpora based Approach for Arabic/English Word Translation Disambiguation
We are presenting a word sense disambiguation method applied in automatic translation of a query from Arabic into English. The developed machine learning approach is based on statistical models, that can learn from parallel corpora by analysing the relations between the items included in this corpora in order to use them in the word sense disambiguation task. The relations between items in this...
متن کامل